Throughput Performance When Applying Deblocking Filters On Reconstructed Image Frames

ABSTRACT

Improving throughput performance when applying deblocking filters on reconstructed image frames. In one embodiment, an image frame received in the form of a set of values in encoded format is decoded to form a second set of values representing a reconstruction of the image frame in a decoded format. The specific one of the pairs of edges (formed by sub-blocks in the image frame) to which a deblocking filter is to be applied is then determined by evaluating any pre-conditions that need to be satisfied according to a standard. The deblocking filter is then applied to the determined specific ones of the pairs of edges, with the application being performed after determining.

RELATED APPLICATION(S)

The present application claims priority from co-pending U.S. provisional application Ser. No. 60/941,881, entitled “Deblocking Filter Implementation on VLIW Architectures for H.264 Video” filed on 30 Apr. 2007, naming the same applicant Texas Instruments Inc (the intended assignee) and the same inventors Anurag Mithalal Jain, Vipulkumar Parasottambhai Paladiya, and Sunand Mittal as in the subject application, attorney docket number: TI-60039PS, and is incorporated in its entirety herewith.

BACKGROUND

1. Field of Disclosure

The present disclosure relates generally to data compression/decompression technologies, and more specifically to improving throughput performance when applying deblocking filters on reconstructed image frames.

2. Related Art

Image frames are often required to be reconstructed from corresponding compressed/encoded data. Reconstruction refers to forming the uncompressed data, which is as close as possible to the original data from which the compressed/encoded data is formed.

For example, data representing a sequence of image frames generated from a video signal capturing a scene of interest is often provided in a compressed/encoded form, typically for reducing storage space or for reducing transmission bandwidth requirements. Such a technique may necessitate the reconstruction of the scene of interest (the sequence of image frames) by uncompressing/decoding the provided data.

H.264 is an example standard using which image frames is represented in a compressed form (thereby necessitating reconstruction). H.264 is described in further detail in “Information technology—Coding of audio-visual objects—Part 10: Advanced Video Coding”, available from ISO/IEC (International Standards Organization/International Electrotechnical Commission).

Deblocking filters are often applied on reconstructed image frames. As is well known, compression/decompression techniques are often “lossy” which could lead to undesirable visual characteristics in the display of reconstructed image frames, and applying the deblocking filters makes the display of reconstructed image frames less objectionable to the human eye.

For example, the image frames reconstructed from data compressed/encoded at low bit rates using the H.264 standard noted above blockiness (block edges) and/or color transitions due to the underlying compression/decompression techniques. By applying deblocking filters, at least in the case of H.264 standard, the eventual display of images can be made less objectionable to the human eye.

Application of deblocking filter generally requires substantial computational time/resources. As such, it may be desirable that throughput performance be improved when applying deblocking filters to reconstructed image frames.

BRIEF DESCRIPTION OF THE DRAWINGS

Example embodiments will be described with reference to the following accompanying drawings, which are described briefly below.

FIG. 1 is a block diagram illustrating an example environment in which several features of the present invention may be implemented.

FIG. 2A is block diagram of the internal details of an H.264 encoder illustrating an example embodiment in which several features of the present invention are implemented in one embodiment.

FIG. 2B is block diagram of the internal details of an H.264 decoder illustrating an example embodiment in which several features of the present invention are implemented in one embodiment.

FIG. 2C depicts the manner in which image frames are compressed/encoded using a block-based compression/encoding technique in one embodiment.

FIGS. 3A, 3B, and 3C together illustrate the manner in which deblocking filters are applied to a reconstructed macro-block (corresponding to block 290) in one embodiment in the context of H.264.

FIG. 4A is a block diagram illustrating the details of processing unit 150A in an embodiment.

FIG. 4B is a block diagram of processing environment containing multiple execution units, each potentially implementing a pipelined architecture in one embodiment.

FIG. 4C depicts the manner in which machine instructions (executable code) may be generated in one embodiment.

FIG. 5 is a flowchart illustrating the manner in which a deblocking filter is applied with enhanced parallelism according to an aspect of the present invention.

FIG. 6 is a flowchart illustrating the manner in which the enhanced parallelism is obtained in application of deblocking filters in one embodiment of the present invention.

FIG. 7 depicts a bit field indicating both the vertical and horizontal edges of a macro block to which a deblocking filter is to be applied in one embodiment.

FIGS. 8A and 8B together illustrate the dependencies in the application of deblocking filter to the edges of a macro-block in one embodiment.

FIG. 9 is a flowchart illustrating the manner in which memory dependencies in processing the edges (in one orientation) of a reconstructed block are reduced according to an aspect of the present invention.

In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.

DETAILED DESCRIPTION 1. Overview

Several features of the present invention can be used to improve the throughput performance of applying deblocking filters on reconstructed image frames. In one embodiment, a set of values representing an image frame in encoded format is received, with the image frame containing multiple macro-blocks. Each macro-block in turn containing multiple sub-blocks forming horizontal and vertical edges, with the edges including pairs of adjacent edges in the same orientation (horizontal or vertical)

The received set of values is first decoded (and/or decompressed) to form a second set of values representing a reconstruction of the image frame in a decoded format. As noted in the background section, deblocking filter may need to be applied on the reconstructed image frames to make the display of the images less objectionable to the human eye.

According to an aspect of the present invention, the specific ones of the pairs of edges to which a deblocking filter is to be applied is determined by evaluating a set of pre-conditions that need to be satisfied according to a standard. The deblocking filter is then applied to the determined specific ones of the pairs of edges, with the application of deblocking filter being performed after determining.

Thus, by determining the specific edges to which the deblocking filter is to be applied, the application of the deblocking filter to the edges can be performed with enhanced parallelism, thereby improving throughput performance.

In one embodiment, the pairs of edges are adjacent edges. Further, the determination of the specific ones of the pairs of edges is performed for all the edges in one orientation (horizontal or vertical) before application of the deblocking filter to the determined pairs is performed. Such determination can further enhance the parallelism in the manner of applying the deblocking filter.

According to another aspect of the present invention, a bit field containing a set of bits, with each bit indicating whether the deblocking filter is to be applied to a corresponding edge is formed.

According to yet another aspect of the present invention, the formed bit field is loaded into a register and then used to identify a next bit (starting from a first bit) indicating a next edge to which the deblocking filter is to be applied. The bit field in the register is then used to identify a following bit (starting from the next bit) indicating a following edge after the next edge to which said deblocking filter is to be applied.

In one embodiment, the identification is performed using an instruction which receives an offset as an input and indicates in the register a next bit position starting from the offset at which the corresponding bit equals a desired binary value (for example “1”). Accordingly, the next bit is identified by invoking the instruction with the offset equal to the bit position of the first bit and the following bit is identified by invoking the instruction with the offset equaling the bit position of the next bit in the bit field loaded into the register.

In an alternative embodiment, identifying the next and following bits is performed by shifting the bit field in the register by a number of positions determined by the bit position at which the next bit is present in the bit field when loaded in the register.

According to one more aspect of the present invention, a number of bits in the bit field (formed according to an aspect described above) indicating that a deblocking filter is to be applied to corresponding edges is determined. Each present edge to which the deblocking filter is to be applied is then identified in a corresponding loop, with the loop being executed the determined number of times/bits.

According to an aspect of the present invention, an edge counter which indicates the number of bit positions from a first bit to a bit representing a present edge (to which a deblocking filter is to be applied) is maintained. The addresses of memory locations storing the specific ones of the second set of values (forming the reconstruction of the image frame) which is required to apply the deblocking filter to the present edge are then computed based on the edge counter.

According to another aspect of the present invention, a present edge (to which a deblocking filter is to be applied as indicated by the bit field) is processed by first loading into a set of registers, the values required as inputs to the deblocking filter from specific memory locations in a memory. The bit field is then checked to determine whether the deblocking filter is to be applied to a base edge corresponding to the present edge (also referred to as a dependent edge), where application of the deblocking filter to the base edge causes at least some of the values in the specific memory locations to be modified to corresponding new values.

The deblocking filter is then applied to the present/dependent edge using the loaded values in the set of registers if the bit field indicates that the deblocking filter is not to be applied to the base edge. Alternatively, the system waits for availability of the new values (caused by applying the deblocking filter to the base edge) before applying the deblocking filter to the present edge if the bit field indicates that the deblocking filter is to be applied to the base edge.

Further, the new values are stored in a buffer, which provides faster access than the memory, with the loaded values in the set of registers being replaced with the new values in the buffer after waiting for the new values to be available. The deblocking filter is then applied to the present edge using the replaced values in the set of registers.

Several aspects of the invention are described below with reference to examples for illustration. It should be understood that numerous specific details, relationships, and methods are set forth to provide a full understanding of the invention. For example, many of the functions units described in this specification have been labeled as modules/blocks in order to more particularly emphasize their implementation independence.

A module/block may be implemented as a hardware circuit containing custom very large scale integration circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors or other discrete components. A module/block may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, or the like.

Modules/blocks may also be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, contain one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may contain disparate instructions stored in different locations which when joined logically together constitute the module/block and achieve the stated purpose for the module/block.

It may be appreciated that a module/block of executable code could be a single instruction, or many instructions and may even be distributed over several code segments, among different programs, and across several memory devices. Further, the functionality described with reference to a single module/block can be split across multiple module/blocks or alternatively the functionality described with respect to multiple module/blocks can be combined into a single (or other combination of blocks) as will be apparent to a skilled practitioner based on the disclosure provided herein.

Similarly, operational data may be identified and illustrated herein within modules and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different member disks, and may exist, at least partially, merely as electronic signals on a system or network.

Reference throughout this specification to “one embodiment”, “an embodiment”, or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment”, “in an embodiment” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

Furthermore, the described features, structures, or characteristics of the invention may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, etc., to provide a thorough understanding of embodiments of the invention.

However one skilled in the relevant art will recognize that the invention can be practiced without one or more of the specific details or with other methods, components, materials and so forth. In other instances, well-known structures, materials, or operations are not shown in detail to avoid obscuring the features of the invention. Further more the features/aspects described can be practiced in various combinations, though only some of the combinations are described herein for conciseness.

2. Example Environment

FIG. 1 is a block diagram illustrating an example environment in which several features of the present invention may be implemented. The example environment is shown containing only representative systems for illustration. However, real-world environments may contain many more systems/components as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein. Implementations in such environments are also contemplated to be within the scope and spirit of various aspects of the present invention.

The diagram is shown containing end systems 110A and 110B designed/configured to communicate with each other in a video conferencing application. End system 110A is shown containing processing unit 150A, video camera 130A, and display unit 170A, while end system 110B is shown containing processing unit 150B, video camera 130B, and display unit 170B. Each component is described in detail below.

Video camera 130A captures images of a scene (a general area sought to be captured), and forwards the captured image to processing unit 150A via path 135. The captured image is forwarded in the form of corresponding image frames, with each image frame containing a set of pixel values representing the captured image when viewed as a two-dimensional area. The image frames (generally in an uncompressed format) may be forwarded from video camera 130A in any of formats such as RGB, YUV, etc.

Processing unit 150A may compress/encode each image frame received from video camera 130A, and forward the compressed/encoded image frames via path 155 to end system 110B. Path 155 may contain various transmission paths (including networks, point-to-point lines, etc.) providing a bandwidth for transmission of the image/video data.

Alternatively, processing unit 150A may store the compressed/encoded image frames in a memory (not shown). Processing unit 150A may also receive compressed/encoded image data from end system 110B, and forward the uncompressed/decoded image data (representing the reconstructed scene) to display unit 170A via path 157 for display.

Processing unit 150B, video camera 130B and display unit 170B respectively operate similar to the corresponding components of end system 110A, and the description is not repeated for conciseness. In particular, end system 110B may reconstruct the scene by decompressing/decoding the image frames received from end system 110A and then may display the reconstructed scene on display unit 170B. Such reconstruction may be performed in both processing unit 150A and 150B, according to several aspects of the present invention, as described below with examples.

Several features of the present invention are described below in a specific context of H.264 standard. However, it should be appreciated that the features can be implemented with respect to other encoding/decoding of sequence of image frames in other contexts and/or other standards as well, as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein.

3. H.264 Standard

FIG. 2A is block diagram of the internal details of an H.264 encoder illustrating an example embodiment in which several features of the present invention are implemented in one embodiment. The encoder may be implemented within processing unit 150A or externally (e.g., using custom ASICs).

Only some of the details as pertinent to the features described below are shown for conciseness. For further details of the H.264 standard, the reader is referred to the document noted in the background section. Further, though shown as separate blocks with distinct functionalities merely for illustration, the various blocks of FIGS. 2A and 2B may be implemented as more/fewer blocks, possibly with some of the functionalities merged/split into one/multiple blocks (particularly when implemented as software modules).

The block diagram is shown containing source image frame 210, reference image frame 215, encoding block 220, compression block 230, compressed/encoded bit stream 235, decoding block 240, reconstructed image frame 245, and deblocking filter 250. Each block is described in detail below.

Source image frame 210 represents one of the image frames received from video camera 130A desired to be compressed/encoded according to the H.264 standard. In one embodiment, each source image frame is encoded using a block-based compression encoding technique as described below.

FIG. 2C depicts the manner in which image frames are compressed/encoded using a block-based compression/encoding technique in one embodiment. In a block-based technique, an image frame is viewed as containing multiple blocks, with each block representing a group of adjacent pixels with a desired dimension and shape. The encoding and decoding of the image frame may then be performed based on the blocks in the image frame.

In the following description, a compressed block refers to a block after compression/encoding, while a reconstructed block refers to the (uncompressed) block generated by uncompressing/decoding a compressed block.

In H.264 standard, each block may be chosen to be a square block of 16×16 pixels size as shown for block 290. However, an image frame can be divided into square blocks of other sizes, such 4×4 and 8×8 pixels. Further, the blocks can be of other shapes (e.g., rectangle or non-uniform shape) and/or sizes in alternative standards. Each of these blocks is hereafter referred to as a macro-block to differentiate from the sub-blocks described in section below.

Accordingly, source image frame 210 (only a portion is shown there for conciseness) is shown as being divided into a number of macro-blocks (shown numbered sequentially from m1 to m99 for reference). Each macro-block represents a group of pixels which are processed together while compressing/encoding source image frame 210.

Encoding block 220 encodes the received source image frame 210 using reference image frame 215 according to H.264 standard. The encoding of the source image frame 210 may be performed with respect to reference image frame 215.

Reference image frame 215 generally represents a reconstructed image frame corresponding to a previous image frame received from video camera 130A prior to (the present) source image frame 210 being compressed. Reference image frame 215 may be received from deblocking filter 250 or alternatively, in the absence of a deblocking filter, correspond to reconstructed image frame 245 generated by decoding block 240. It should be noted that reference image frame 215 is not similar to the previous image frame due to lossy video compression schemes.

Each macro-block (such as block 290) is encoded by first finding the difference between the values of the (16×16) pixels in the macro-block and the values of the corresponding pixels in a reference macro-block (contained in source image frame 210 or reference image frame 215). The difference between the macro-blocks is often expressed in terms of luma (representing the brightness information) and chroma (representing the color information) corresponding to the pixels.

Encoding block 220 then encodes the differences to generate a corresponding encoded macro-block data. For example, the difference may be transformed using a block transform and then quantized to generate a corresponding set of quantized transform coefficients (thereby compressing the data) representing the macro-block being encoded. Such quantization may result in “lossy” compression of the image frame, whereby some of the visual information contained in source image frame 210 is not compressed/encoded and therefore cannot be reconstructed when decoding the corresponding compressed data.

In one embodiment, each macro-block in the image frame is encoded to generate a corresponding 16×16 luma block and two 8×8 chroma blocks. The color information (chroma) is generally less compared to the brightness information (luma) since human perception is perceived to be less sensitive to color changes in comparison to brightness changes.

Encoding block 220 then assembles the encoded macro-block data corresponding to the macro-blocks forming source image frame 210 to form the encoded image data and forwards (makes available) the encoded image data to compression block 230.

Compression block 230 further compresses encoded image data using entropy-encoding techniques, well known in the relevant arts. Entropy encoding may involve using fewer number of bits to encode more frequently occurring data in the encode image data and using more bits to encode less frequently occurring data.

The compressed/encoded image data is then generated in the form of compressed/encoded data stream 235 (containing a set of values in encoded format), which may then either be stored or transmitted to a recipient system such as end system 110B. Compressed/encoded data stream 235 may represent the entire image frame or portions of it in a compressed/encoded form, and may include information (such as size/dimension/shape of each of the corresponding macro-blocks) to enable a device (such as processing unit 150B of FIG. 1) to decompress/decode the image frame accurately.

Decoding block 240 receives the output of encoding block 220 and decodes the encoded image data. Such decoding may be necessary to generate reference image frame 215 to be used in encoding the next image frame received from video camera 130A.

Decoding block 240 reconstructs the macro-block (in reconstructed image frame 245) from the corresponding macro-block data, as well as previously decoded macro-blocks which may be retrieved from a storage unit (not shown). Decoding block 240 may substantially perform the reverse of the corresponding operations used to compress and encode a macro-block, such as an inverse quantization and inverse transform, performed by encoding block 220.

Decoding block 240 then assembles the reconstructed macro-blocks to generate reconstructed image frame 245, which is then forwarded to deblocking filter 250. Deblocking filter 250, provided according to several aspects of the present invention, removes the visual defects in reconstructed image frame 245 to generate reference image frame 215 as described in below sections.

It may be appreciated that a similar approach may be used in decompressing/decoding the compressed/encoded data stream 235 as described in detail below.

FIG. 2B is block diagram of the internal details of an H.264 decoder illustrating an example embodiment in which several features of the present invention are implemented in one embodiment. The decoder may be implemented within processing unit 150B or externally (e.g., using custom ASICs). Only some of the details as pertinent to the features described below are shown for conciseness.

Decompression block 260 receives the compressed/encode image frame in the form of compressed/encoded data stream 235 and may substantially perform the reverse of the operations performed by compression block 230 to generate the encoded image data. Decompression block 260 may then forward the encoded image data to decoding block 240.

Decoding block 240 reconstructs the image frame from the encode image data in the form of reconstructed image frame 245 (containing a set of values in a decoded format), which is then processed by deblocking filter 250 to generate displayed image frame 265. Displayed image frame 265 may be displayed on display unit 170B.

It may be appreciated that displayed image frame 265 corresponds (at least substantially) to source image frame 210 after being compressed and decompressed according to H.264 standard. As described above, it may be necessary to apply deblocking filter to reconstructed image frame 245. The general concepts underlying such application of deblocking filter according to the H.264 standard are described below with examples.

4. Applying Deblocking Filters

FIGS. 3A, 3B, and 3C together illustrate the manner in which deblocking filters are applied to a reconstructed macro-block (corresponding to block 290) in one embodiment in the context of H.264. Each of the Figures is described in detail below.

According to H.264 standard, the deblocking filter is to be applied to each square block of 4×4 pixels (hereafter referred to as a sub-block) in the reconstructed image frame. As such, each reconstructed macro-block (16×16 pixels) may be viewed as containing 16 sub-blocks of 4×4 pixels. The application of the deblocking filter may then be performed in the context of the sub-blocks.

FIG. 3A illustrates the order in which each of the sets of horizontal and vertical edges (formed between sub-blocks) are to be processed for deblocking. H.264 requires that the vertical edges be processed before horizontal edges and accordingly the example embodiments below are described based on that constraint. However, it should be appreciated that alternative embodiments/standards can be implemented with different order of processing of edges, as desired in specific environments, without departing from the scope and spirit of several aspects of the present invention.

FIG. 3A depicts 16 vertical edges (shown as v0-v15 in 310 i.e., edges of same vertical orientation), and 16 horizontal edges (shown as h0-h15 in 320, i.e., edges of same horizontal orientation) that are processed for the luma information corresponding to a macro block (m41 or block 290).

The vertical edges v0-v15 is processed first according to the sequence numbers associated with each edge followed by the horizontal edges h0-h15 (also according to the sequence numbers). It should be appreciated that each edge may be viewed as covering the area between the displays caused by the adjacent pixels as also shown in FIG. 3B.

Thus, vertical edge v0 between sub-blocks 312 and 315 (a sub-block in the previous macro-block m40) is the area between pixel pairs of {p0, q0}, with p0 representing the boundary pixel for sub-block 315 and q0 representing the boundary pixel for sub-block 312, as depicted in FIG. 3B. The horizontal edge h4 is the area in the boundary of sub-blocks 322 and 325 between pixel pairs of {m0, n0} as also illustrated in FIG. 3B. The remaining edges for luma information are defined similarly.

It may be further appreciated that v1, v2, v3 are respectively adjacent (of same orientation) edges to v0, v1, and v2. The remaining vertical adjacent edges are similarly described with respect to 310. Similarly, h1, h2, h3 are also respective adjacent (in the horizontal orientation) edges to h0, h1 and h2.

It may be observed in FIG. 3B that the pairs of pixels on either side of edge v0 are labeled similarly as {p0, q0}. However, the pixel pairs of {p0, q0} represent (four) different pairs of pixels and may have different values based on the encoded data. The labeling of the different pixel pairs using the same labels is merely for convenience in describing the embodiments of the present invention.

The order of processing the edges c0-c7 for the chroma information corresponding to the macro-block (m41 or block 290) may be similarly understood based on the depiction at 330 in FIG. 3A. Various features of the invention hereafter are substantially described with respect to processing of luma information for conciseness. However, the processing may be applicable to chroma information as well, as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein.

FIG. 3C indicates the rules set by H.264 standard with respect to the number of adjacent pixels to be used (“boundary strength”) as inputs to the deblocking filter while processing each edge. Column 360 specifies the conditions of the rule and column 365 specifies the corresponding boundary strength. As is well known, the boundary strength is calculated as per the process specified in the H.264 standard and is dependent of multiple data fields (such as motion vector, quantization parameter, macro-block type, etc.,) decoded from compressed/encoded data stream 235.

Thus, row 371 indicates that when either of the two sub-blocks (conveniently named p and q and which may correspond to sub-blocks 322 and 325) is intra coded and the edge is a macro-block edge (e.g., edges v0-v3 and h0-h3 in FIG. 3A), the boundary strength is 4, indicating that four pixels on both sides of the edge (e.g., p0-p3 and q0-q3 in FIG. 3B) are to be used as inputs to the deblocking filter. The remaining rows 372-375 are similarly explained, with row 375 indicating the condition under which deblocking filter need not be applied (boundary strength=0).

During operation, along with the compressed values, additional information may be received indicating the manner of encoding of each macro-block, which facilitates a determination of whether a macro-block is a candidate for application of deblocking filter (i.e., having a boundary strength greater than 0).

Even if the boundary strength is greater than 0, a decision on whether to apply the deblocking filter or not, may be based on the below equation (referred to as threshold requirements):

|p0-q0|, |p1-p0|, and |q1-q0| are each less than a threshold t1 or t2  Equation (1)

wherein, | | represents the absolute value operator, the pixels {p1, p0, q0, q1} have been determined to be used as inputs to the deblocking filter and t1 and t2 are thresholds specified by the H.264 standard (and are commonly referred to as alpha and beta thresholds). It should be appreciated that the values p0, q0, etc. of above equation need to be used after any modification (or computation of new values) by application of deblocking filters to corresponding base edges.

Once a determination is made to apply the deblocking filter, a specific formula based on the boundary strength is applied. The formulas used for applying deblocking are not described as being not relevant to an understanding of the described embodiment. However, it is sufficient to understand that as each (vertical/horizontal) edge is filtered, the pixels used for inputs are recomputed and the recomputed/output values may replace the input values. Once replaced, the new values may be used for filtering the later edges according to the sequence described above with respect to FIG. 3A.

It may be appreciated that the boundary strength and the threshold requirements represent pre-conditions that need to be satisfied for the application of the deblocking filter in the H.264 standard. However other standards may specify different/other additional pre-conditions that need to be satisfied in determining the application of the deblocking filter as will be apparent to one skilled in the relevant arts.

It should be appreciated that several features of the invention described below can be implemented in various embodiments as a desired combination of one or more of hardware, software, and firmware. The description is continued with respect to an embodiment in which various features are operative when software instructions are executed.

5. Software Implementation

FIG. 4A is a block diagram illustrating the details of processing unit 150A in an embodiment. The description below also applies to processing unit 150B.

Processing unit 150A may contain one or more processors such as central processing unit (CPU) 410, random access memory (RAM) 420, secondary storage unit 450, display controller 460, network interface 470, and input interface 480. All the components may communicate with each other over communication path 440, which may contain several buses as is well known in the relevant arts. The components of FIG. 4 are described below in further detail.

CPU 410 may execute instructions stored in RAM 420 to provide several features of the present invention. CPU 410 may contain multiple execution units as described below with respect to FIG. 4B, with each execution unit potentially being designed for a specific task. Alternatively, CPU 410 may contain only a single general-purpose processing unit.

RAM 420 may receive instructions from secondary storage unit 450 using communication path 440. In addition, RAM 420 may store video frames received from a video camera during the encoding operations noted above. Display controller 460 generates display signals (e.g., in RGB format) to display unit 170B (FIG. 1) based on data/instructions received from CPU 410.

Network interface 470 provides connectivity to a network (e.g., using Internet Protocol), and may be used to receive/transmit compressed/encoded video/image frames on path 155 of FIG. 1. Input interface 480 may include interfaces such as keyboard/mouse, and interface for receiving video frames from video camera 130A.

Secondary storage unit 450 may contain hard drive 456, flash memory 457, and removable storage drive 458. Some or all of the data and instructions may be provided on removable storage unit 459, and the data and instructions may be read and provided by removable storage drive 458 to CPU 410. Floppy drive, magnetic tape drive, CD-ROM drive, DVD Drive, Flash memory, removable memory chip (PCMCIA Card, EPROM) are examples of such removable storage drive 458.

Alternatively, data and instructions may be copied to RAM 420 from which CPU 410 may read and execute the instructions using the data. Removable storage unit 459 may be implemented using medium and storage format compatible with removable storage drive 458 such that removable storage drive 458 can read the data and instructions. Thus, removable storage unit 459 includes a computer readable (storage) medium having stored therein computer software and/or data.

In general, the computer (or generally, machine) readable medium refers to any medium from which processors can read and execute instructions. The medium can be randomly accessed (such as RAM 420 or flash memory 457), volatile, non-volatile, removable or non-removable, etc. While the computer readable medium is shown being provided from within processing unit 150A for illustration, it should be appreciated that the computer readable medium can be provided external to processing unit 150A as well.

In this document, the term “computer program product” is used to generally refer to removable storage unit 459 or hard disk installed in hard drive 456. These computer program products are means for providing software to CPU 410. CPU 410 may retrieve the software instructions, and execute the instructions to provide various features of the present invention described below. Groups of software instructions in any form (for example, in source/compiled/object form or post linking in a form suitable for execution by CPU 410) are termed as code.

It may be appreciated that though the H.264 standard requires the edges to be processed in a particular sequence, it may be desirable that as many computations as possible be performed. Accordingly, the edges may be processed in parallel subject to the dependency requirements caused, for example, by the boundary strengths and the need to use the recomputed values to filter the later edges.

In one embodiment described below, multiple execution units are employed to potentially process multiple edges in parallel. The manner in which the throughput performance of applying a deblocking filter may be enhanced in such an environment is described below with examples (even though various features of the present invention can be implemented in other types of environments, potentially without multiple execution units, as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein).

6. Processing Environment

FIG. 4B is a block diagram of processing environment containing multiple execution units, each potentially implementing a pipelined architecture in one embodiment. A pipelined architecture refers to an implementation technique where instructions are executed in a sequence of stages thereby facilitating different multiple instructions (or part thereof) to be executed in parallel.

CPU 410 represents such a processing environment implementing a pipelined architecture and is shown containing instruction cache 411, instruction register 412, data registers 413, execution units 415A-415D, and data cache 417. Merely for illustration, only representative number/type of components is shown in the Figure. Many processing environments often contain many more components, both in number and type, depending on the purpose for which the processing environment is designed.

It should be appreciated that the pipelining technique and/or the multiple-execution-units are pertinent to only some of the features of the invention, as will be clear from the corresponding context. Further, the execution units can be present as different CPUs as well. Each component of FIG. 4B is described below in further detail.

Instruction cache 411 maintains machine instructions to be executed. The instructions may be loaded from a memory (such as RAM 420 via path 440) prior to commencement of execution. The instructions together represent in machine executable form, a software module designed to apply the deblocking filter to the different edges in a reconstructed macro-block.

Instruction register 412 stores the machine instruction currently being executed. During execution, each machine instruction in instruction cache 411 is loaded into instruction register 412 which then holds the software instruction while being decoded and executed by the different execution units 415A-415D.

Data registers 413 contain various registers, with each register having capabilities such as holding the input values to an instruction, storing the execution results, providing access to/from data between execution units, etc. In general, the registers provide a small amount of storage (in comparison to data cache 417 and RAM 420) while typically providing fast access to the data.

Data cache 417 represents a temporary storage storing frequently accessed data (but less than the frequency of access of the data stored in data registers 413 in one embodiment). Once a data is stored in data cache 417, future use can be made by accessing the cached copy rather than fetching (from memory such as RAM 420) or recomputing the original data. The data stored in data cache 417 may be periodically written to the memory.

Memory (such as RAM 420) may be viewed as containing multiple memory locations, with each location storing a corresponding data. As such, accessing specific data values may require CPU 410 to specify the corresponding memory locations. In general, accessing the data in memory locations is slower than accessing data in data cache 417 which is slower than accessing data in data registers 413.

Each of execution units 415A-415D may be designed to independently execute a corresponding given set of machine instructions together designed to perform a logical task (e.g., processing of a single edge, upon appropriate design of software module 490 and/or compiler 494, described below).

Each of the execution units may further containing functional units (representing a stage in the pipelined architecture) capable of performing corresponding specific operations, such as, loading/storing data values, branching based on conditions, performing integer/floating point operations, etc. Such functional units may fetch the machine instructions being currently executed (or part thereof) from instruction register 420, decode the machine instructions, and perform the corresponding operations indicated by the machine instruction.

It may be appreciated that the throughput performance of deblocking may be enhanced by utilizing the parallelism possible by the presence of multiple execution units and pipelining features.

Several aspects of the present invention enable the parallelism to be exploited with respect to application of deblocking filters. The manner in which the software code (and consequently the resulting machine instruction) can be specified/written to implement the deblocking filters is further described with respect to an example environment supporting the architecture of FIG. 4B described above.

7. Generating Machine Instructions for Parallelism

FIG. 4C depicts the manner in which machine instructions (executable code) may be generated in one embodiment. Software module 490 represents the software code containing user instructions written by a developer. The software code may be specified in any programming language, though higher-level languages (e.g., C, C++, and Java) are generally preferred to enhance the developers' productivity.

Compiler 494 processes the software code (in the specified programming language) to generate executable code 498 containing machine instructions suitable for execution in the processing environment of FIG. 4B. Concepts such as object files, linking, target machine specification, code generation, etc., are not described in detail as not being pertinent to the concepts sought to be illustrated. Compiler 494 may be designed to exploit the parallelism possible in the processing environment (for example, the environment described above), potentially by reordering the logic without violating dependencies.

However, it is desirable that software module 490 by itself contain processing logic which lends to further exploitation of the parallelism in the processing environment. The manner, in which the user instructions can be designed for enhanced parallelism in several contexts, is described below. It should be however appreciated that the features of the invention can be realized by embedding the corresponding intelligence in compilers type systems software as well.

According to aspect of the present invention, such enhanced parallelism is obtained in the application of deblocking filter to edges of a reconstructed macro-block. Such a feature will be clearer in comparison to a prior approach which uses an alternative technique, and accordingly the prior approach is described briefly below.

8. Prior Approach

In one prior approach, each edge may be allocated to one of the execution units, which then determines whether the edge meets the requirements set forth with respect to FIG. 3C, and applies deblocking filter to the edge if the requirements are met.

Such an approach causes irregular branching (since the “if” condition checking whether the edge meets the pre-conditions, could fail or succeed) during processing of each edge, thereby breaking the pipelining process. The breaking of the pipeline reduces the parallelism, as is well known in the relevant arts. The reduced parallelism may in turn impede the parallelism possible across the execution units (since the dependent edges need to wait for completion of processing of the base edge).

A software module (or user instructions) designed according to several aspects of the present invention improves the throughput performance when applying deblocking filter to reconstructed image frames, while overcoming some of the disadvantages of the above prior approach.

As may be appreciated, the user instructions (or corresponding machine instructions, also stored on a machine readable medium) in turn causes CPU 410 (or the components therein) to operate in the corresponding manner. Accordingly, the design of the user instructions is described with reference to the effective operation of CPU 410 in the description below.

9. Applying Deblocking Filter with Enhanced Parallelism

FIG. 5 is a flowchart illustrating the manner in which a deblocking filter is applied with enhanced parallelism according to an aspect of the present invention. The flowchart is described with respect to FIGS. 1, 4A and 4B, merely for illustration. However, various features can be implemented in other environments and other components.

Furthermore, the steps are described in a specific sequence merely for illustration. Alternative embodiments in other environments, using other components and different sequence of steps can also be implemented without departing from the scope and spirit of several aspects of the present invention, as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein. The flowchart starts in step 501, in which control passes immediately to step 520.

In step 520, CPU 410 determines the specific ones of the edges of the (reconstructed) macro-block to which to a deblocking filter is to be applied based on boundary strength. In general, a set of pre-conditions for each specific edge that are to be satisfied prior to applying a deblocking filter, according to the applicable standard (H.264 in the illustrative example), may be evaluated for the determination.

In the case of H.264 standard, the determination may be performed similar to the manner described above with respect to FIG. 3C and therefore the description is not repeated in detail for conciseness. In summary, deblocking filter is to be applied for an edge if the corresponding boundary strength is greater than 0.

It should be noted that not all the pre-conditions (for application of deblocking filter) need be checked, as suited in specific environments. For example, in the example embodiment of below, the threshold requirement is not checked for determining the edges to which the deblocking filter is to be applied, since the threshold requirement is based on the values of the pixels on either side of an edge which may be modified by the application of the deblocking filter to the base edge.

In step 560, CPU 410 applies the deblocking filter to each of the determined specific edges of the macro-block in the order specified in FIG. 3A. In particular, new values for a number of adjacent pixels determined by the boundary strength of the edge being filter may be computed, as a result. The new values, along with any unchanged values of the various blocks, together represent the reconstructed frame. The flowchart ends in step 599.

Thus, CPU 410 may first determine all the specific edges of the macro-block that are to be filtered by evaluating any applicable pre-conditions as a batch and then applies the deblocking filter to only the determined edges (i.e., in case the pre-conditions are satisfied) again as a batch. This means the determination and applying deblocking filter steps are not interspersed. In an embodiment, this manifests in software code with the determination being outside of program structures such as loops which apply the deblocking filters to each of the edges.

It may be appreciated that by determining the specific edges to be filtered prior to application of the deblocking filter, the irregular branching caused by prior approaches can also be avoided within the individual execution units applying the deblocking filter to the corresponding edge, thereby improving the performance of the deblocking filter (by increasing the parallelism in CPU 410).

Furthermore, as the determination of step 520 for each edge can be performed without dependency on other edge, it may be possible to utilize as many execution units as are available for the determination, thereby increasing the throughput performance. Several features of the present invention provide for enhanced parallelism even within execution of step 560, as described in sections below.

It should be appreciated that the flowchart of FIG. 5 can be implemented using various approaches, with corresponding advantages. The description is continued with respect to an example implementation of realizing the above noted features.

10. Example Implementation of Enhanced Parallelism

FIG. 6 is a flowchart illustrating the manner in which the enhanced parallelism is obtained in application of deblocking filters in one embodiment of the present invention. The flowchart is described with respect to FIGS. 1, 4A and 4B, merely for illustration. However, various features can be implemented in other environments and other components.

Furthermore, the steps are described in a specific sequence merely for illustration. Alternative embodiments in other environments, using other components and different sequence of steps can also be implemented without departing from the scope and spirit of several aspects of the present invention, as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein.

It may be appreciated that steps of FIG. 6 are first performed for processing the vertical edges in a macro-block and then may be performed again to process the horizontal edges in the macro-block as specified by the H.264 standard. Accordingly, the steps of the flow chart are described in relation to the processing of the vertical edges, though the description is applicable to the processing of the horizontal edges as well. The flowchart starts in step 601, in which control passes immediately to step 610.

In step 610, CPU 410 generates a bit field representing the (vertical/horizontal) edges in a macro-block to which a deblocking filter is to be applied. A bit field contains a set of bits, with each bit representing a corresponding Boolean flag (having the values “true” or “false”) indicating whether the deblocking filter is to applied to the corresponding edge.

In general, the false value may be represented as a bit value of “0” or “1” with the true value represented as a bit value of the opposite parity (“1” in the case of “0” and vice versa). In one embodiment, the true and false values are respectively represented as bit values “1” (indicating that the filter is to be applied) and “0” (indicating that the filter need not be applied).

It may be appreciated that the bit field can be formed while performing the determination of step 520 described above. The generated bit field may be loaded into a register (in data registers 413) during the processing of the edges. The generated bit field may indicate only the vertical or horizontal edges in the macro-block or a combination of both.

In one embodiment, shown in FIG. 7, the bit field indicates both the vertical and horizontal edges of a macro block to which a deblocking filter is to be applied. As described above, the value of each bit in bit field 720 indicates whether the corresponding vertical/horizontal edge is to be filtered (value 1) or not (value 0).

The bit positions of the bits in bit field 720 are indicated in 710, while the edges corresponding to the bits are indicated in 730. Accordingly, the horizontal edges (h0-h15) are represented by the left most 16 bits (in the bit positions 0-15) while the vertical edges (v0-v15) are represented by the right most 16 bits (in the bit positions 16-31).

It may be observed that the bit position corresponding to an edge indicates the position of the edge in the sequence in which all the edges in the macro-block are to be filtered (within the group of edges in each orientation), as shown in FIG. 3A. For example, bit position 5 corresponds to horizontal edge h5 which is filtered fifth according to the horizontal sequence of FIG. 3A. Similarly, bit position 23 (or 7 for the vertical orientation) corresponds to the vertical edge v7 which is filtered seventh according to the vertical sequence of FIG. 3A.

The description is continued assuming that the left most 16 bits in the bit field representing the horizontal edges is first extracted to generate a horizontal edge bit field before performing the below steps. The bits representing the vertical edges may be extracted to form a vertical edge bit field when processing vertical edges.

In step 620, CPU 410 sets a variable ‘loop count’ equal to the number of edges to which the deblocking filter is to be applied. According to the convention of above, loop count would equal the number of ‘1’ in the vertical/horizontal edge bit field. Thus, for bit field 720, loop count may be set equal to 8 when processing vertical edges and to 7 when processing horizontal edges.

In one embodiment, a machine instruction is provided to count the number of ‘1’ in a register, and accordingly the vertical/horizontal edge bit field may be loaded into the register and the corresponding machine instruction may be executed to determine the value for loop count.

The steps of 640-690 operate to apply the deblocking filter to each determined vertical/horizontal edge of the macro-block (to which the deblocking filter is to be applied). The loop would be executed only as many times as the number of vertical/horizontal edges determined in step 520.

In step 640, CPU 410 checks whether the loop count is greater than 0. Control passes to step 660 if loop count is greater than 0 (indicating that there is at least one edge that is to be filtered) and to step 699 otherwise. The flowchart ends in step 699 indicating that there are no more edges to be filtered.

In step 660, CPU 410 identifies an edge to be processed using the vertical/horizontal edge bit field. The edge may be identified using a single instruction or a set of instructions based on the instruction set capable of being executed by CPU 410.

In one embodiment, the identification of the edge is performed using an instruction which receives an offset as an input and indicates a next bit position starting from the offset at which the corresponding bit equals a desired binary value (“1” indicating an edge to which the deblocking filter is to be applied).

Thus, the instruction is first invoked with an offset (by convention chosen as −1) to identify the next bit position at which the corresponding bit indicates an edge to which the deblocking filter is to be applied. During the next execution of step 660, the instruction is invoked with the offset equaling the next bit position to identify a following bit (also having a value of “1”) indicating a following edge to which the deblocking filter is to applied after applying to the next edge.

Referring to FIG. 7, the instruction is first invoked with offset=−1 and returns the value of 0 indicating that the next edge h0 to which the deblocking filter is to be applied. The instruction is then invoked (during the next execution) with offset=0 (the next bit position) to return the value of 3 indicating the following edge h3 to which the deblocking filter is applied. Similarly, the instruction is invoked in subsequent executions (with the offset equaling the bit position of the edge determined in a previous invocation) to identify the edges to which the deblocking filter is to be applied.

In an alternative embodiment, the identification of the edge to be processed is performed by shifting the bit field (loaded in the register) by a number of positions determined by the bit position at which the next bit is present in the bit field.

The number of positions to be shifted may be determined by a specific instruction such as “LMBD” (left most bit detect) which receives a bit field and a flag indicating the bit value to be detected as inputs and provides the position (from the left most bit) of the next bit in the bit field having the bit value indicated by the flag.

Accordingly, when the LMBD instruction is invoked with bit field 720 and a flag “1” (indicating that the deblocking filter is to be applied) as inputs, the value of “0” is generated as the output indicating that the deblocking filter is to be applied to the edge h0. The output value is stored in a register (acting as an accumulator) for future use.

Bit field is then shifted left by 1 number of positions (determined as 1 more than the output value) resulting in the “1” bit at bit position 0 being removed and each of the bits in the other bit positions being moved to a bit position to their respective left. Thus, the shifted bit field contains “0010” in the bit positions 0-3.

During the next execution of step 660, the LMBD instruction is again invoked with the shifted bit field and the flag as inputs to generate the value of “2” as the output. The (output value +1) is added to the value in the accumulator to derive the position of the next edge to be filtered. In this case, the value “3’ in the accumulator indicates that the deblocking filter is to be applied to the edge h3 (bit position 3 in bit field 720). The shifted bit field is again shifted by 3 number of positions (1 more than the output value) and the process is repeated for identifying the edges to which the deblocking filter is to be applied.

In step 670, CPU 410 determines the memory locations at which the values of the pixels required to filter the edge are stored using the bit-position in the bit-field for the edge to be filtered. The determination of the specific memory locations may be performed in a known way.

In one embodiment, a variable named ‘edge counter’ indicating the specific edge being processed is maintained. The edge counter indicates the number of bit positions from a first bit to a bit representing the specific/present edge to which a deblocking filter is to be applied.

Referring to FIG. 7, the edge counter is initially set to the value 0 (since the bit position 0 corresponding to edge h0 has a bit value of “1”). During the next execution of step 670, the edge counter is set to the value 3, the bit position of the following edge h3. The value of the edge counter may be returned by an instruction or may be maintained as a sum/accumulator of the values returned by the instruction (for example, LMBD described above).

The value of the edge counter may then be used to determine the memory locations. In one embodiment, a lookup table is maintained indicating the memory locations at which the values of the pixels for each of the horizontal/vertical edge are stored. The lookup table is indexed based on the edge counter, thereby facilitating the determination of the specific memory locations.

Alternatively, the memory locations may be computed based on specific edge being processed (as indicated by the value of the edge counter) in combination with the offset locations at which the macro-block is stored, the size of each memory location, the boundary strength, etc.

In step 680, CPU 410 applies the deblocking filter to the edge to cause the values in the memory locations to be modified to corresponding new values. The application of the deblocking filter may involve loading the pixel values from the computed memory locations, the performance of the filter operations to generate new values (not described for conciseness) and storing the new values to the corresponding memory locations.

In step 690, CPU 410 decrements the loop count by 1 indicating that the edge has been processed. Control then passes to step 640, where the loop count value is checked to determine whether more edges are to be processed.

It may be appreciated that the above steps provide enhanced parallelism in a scenario that the processing of the different horizontal/vertical edges can be performed independently. It may be desirable that the processing of the edges be performed with maximum parallelism even in a scenario that dependencies exist among the edges.

An aspect of the present invention enables dependent edges to be processed while providing enhanced parallelism, thereby improving the throughput performance when applying deblocking filters on reconstructed image frames. It may be helpful to first understand the manner in which dependencies exists among the edges and accordingly the description is continued illustrating the dependencies existing in the application of the deblocking filter to a macro-block in one embodiment.

11. Dependencies in Applying Deblocking Filter

FIGS. 8A and 8B together illustrate the dependencies in the application of deblocking filter to the edges of a macro-block in one embodiment. Each of the Figures is described in detail below.

FIG. 8A depicts the dependencies in processing vertical edges in one embodiment. In particular the Figure depicts the dependency between the vertical edges v0 (between sub-blocks 315 and 312) and v4 (between sub-blocks 312 and 318). In one scenario, when the vertical edges v0 and v4 are determined to have respective boundary strengths of 3 and 2, the pixels {p2, p1, p0, q0, q1, q2} represent the inputs to the deblocking filter for edge v0, while the pixels {r1, r0, s1, s0} represent the inputs for edge v4.

It may be observed that pixels r1 and q2 refer to the same pixel (shown as the multiple value “q2/r1” in the corresponding box) indicating that the new value of pixel q2 is to be used as the value of r1 in processing the vertical edge e5. As such, it may be necessary that the application of the deblocking filter to edge v4 be performed after the processing of edge v0 (at least until the new values of the common pixels are generated).

Accordingly, the edge v4 (dependent edge) is said to have a dependency on edge v0 (base edge). The dependency is applicable to each of the horizontal/vertical edges in the reconstructed macro-block. For macro-block edges such as v0 and h0, the dependency may be with respect to an edge in another macro-block.

Similarly, FIG. 8B depicts the dependencies in processing horizontal edges of a macro-block in one embodiment. In particular, the Figure indicates that the processing of edge h8 (between sub-blocks 325 and 328) requires the new values of the rows of pixels n1 and n2 (shown as “n1/j2 and “n2/j1”), and therefore it may be necessary that the application of the deblocking filter to edge h8 be performed after the processing of edge h4 (between sub-blocks 322 and 325).

It may be appreciated that since each sub-block is of size 4×4 pixels, such dependencies often occur when a pair of edges (in the same orientation) are determined to be filtered using boundary strengths greater than 2. In H.264 standard, dependencies are common while processing the luma information, since for chroma information the maximum boundary strength used is 2 which causes no overlapping pixels between pairs of edges.

Accordingly, the various aspects of the present invention are described with respect to processing of luma information. However, the features described below can be implemented for luma and chroma information as well when encoding/decoding image frames in other contexts and/or other standards as will be apparent to one skilled in the relevant arts by reading the disclosure herein.

It may be noted that each edge (horizontal or vertical) may have a dependency on only one other corresponding base edge at least based on the sequence of processing the edges. In one embodiment, when the edges are numbered in a sequence (as depicted in FIG. 3A), the base edge is determined by subtracting 4 from the sequence number of the dependent edge (with a value less than 0 indicating that the base edge occurs in another macro-block). For example, for the horizontal edge h15, the base edge can be calculated to being h11 (15-11).

It may be appreciated that the edge dependencies described above may cause memory dependencies while processing the edges in a reconstructed macro-block. In one embodiment, the luma information corresponding to each sub-block is stored as 4 words in a memory, with each word containing 4 bytes, each byte storing the luma information corresponding to a pixel. Thus, in FIG. 3B, the 4 words represent the corresponding 4 rows of pixels {p3, p2, p1, p0} or the corresponding rows of pixels {m3}, {m2}, {m1} and {m0}.

In such a scenario, a dependency between horizontal edges causes a memory dependency of at least one memory location (rows of pixels {n2/j1} in the example above), while a dependency between vertical edges causes a memory dependency for at least 4 memory locations (4 rows of pixels {q0, q1, q2/r1}).

It may be desirable that such memory dependencies be reduced (or removed) to enhance the parallelism in the processing of the edges, thereby improving the performance of the deblocking filter.

Various aspects of the present invention enable such memory dependencies to be reduced when processing horizontal/vertical edges of a reconstructed macro-block. The description is continued illustrating the manner in which memory dependencies are reduced when processing edges (in one orientation) in the reconstructed macro-block.

12. Processing Edges in One Orientation

FIG. 9 is a flowchart illustrating the manner in which memory dependencies in processing the edges (in one orientation) of a reconstructed block are reduced according to an aspect of the present invention.

The description is continued assuming that the horizontal edges of the reconstructed macro-block are being processed and accordingly the flowchart is described with respect to FIGS. 4A, 4B, and 8B, merely for illustration.

Further, it is assumed for convenience that the processing of a horizontal edge is being performed by one of the execution units (415A) contained in CPU 410, though the processing of different edges may be performed by different execution units in parallel, at least as permitted by several aspects of the present invention described below. However, the various features can be implemented in other environments and other components.

Furthermore, the steps are described in a specific sequence merely for illustration. Alternative embodiments in other environments, using other components and different sequence of steps can also be implemented without departing from the scope and spirit of several aspects of the present invention, as will be apparent to one skilled in the relevant arts by reading the disclosure provided herein. The flowchart starts in step 901, in which control passes immediately to step 910.

In step 910, execution unit 415A receives the memory locations storing the values of the pixels required to filter an horizontal (or present) edge of a reconstructed macro block and also a bit field indicating the edges in the macro block to which a deblocking filter is to be applied.

The received memory locations may correspond to the memory locations computed in step 670, while the received bit field may correspond to the bit field generated in step 610 (an example of which is shown in FIG. 7). As described above, the bit field indicates the specific edges to which a deblocking filter is to be applied (in one embodiment, by corresponding bit values of “1”).

The memory locations and the bit field may be received from another execution unit (or a scheduling unit not shown) which identifies the horizontal edge to be processed. The description is continued assuming that the deblocking filter is to being applied to horizontal edge h8.

In step 930, execution unit 415A reads/loads an input set of values from the corresponding memory locations (received in step 910). The reading/loading and writing/storing of the values from/to the memory locations may be performed in a known way. The input set of values may be read into a set of registers provided in data registers 413.

The number of values to be read/loaded may be determined based on the boundary strength (which indicates the number of pixels to be used as inputs to the deblocking filter). In one embodiment, the input set of values corresponding to the two sub-blocks (forming the edge) is retrieved in the form of 1-8 32-bit words (1-4 words per sub-block) from respective memory locations in memory 480.

Thus, while processing edge h8 and assuming a boundary strength of 3, the 3 words corresponding to the rows of pixels {j2}, {j1}, and {j0} in sub-block 325 and the 3 words corresponding to the rows of pixels {k2}, {k1}, and {k0} in sub-block 328 are read into a set of registers provided in data registers 413. For convenience, the registers are named j2_3210, j1_3210, j0_3210, k0_3210, k_(—)3210, and k2 _(—)3210 with the name indicating the pixels stored in the corresponding register.

In step 940, execution unit 415A checks whether the bit field (received in step 910) indicates that the deblocking filter is to applied to the base edge (the edge on which the horizontal edge h8 is dependent upon as described above with respect to FIG. 8B).

Execution unit 415A first determines the base edge corresponding to the horizontal edge in a convenient/suitable manner. In one embodiment described above, the base edge is determined by subtracting 4 from the sequence number of the horizontal edge. Thus, for the horizontal edge h8, the base edge is calculated to be h4 (8-4).

Execution unit 415A then checks whether the bit field indicates that the deblocking filter is to be applied to the base edge by inspecting the value of the bit corresponding to the base edge in the bit field. Control passes to step 950 if the bit has a value of “1” (indicating that the deblocking filter is to be applied to the base edge) and to step 960 otherwise.

In step 950, execution unit 415A replaces the dependent values in the input set with corresponding values from a buffer. The dependent values may be determined based on the pixels that are common to both the horizontal edge and the dependent edge. The values in the buffer represent the new values corresponding to the common pixels.

In the above example, assuming that the base edge h4 is filtered using a boundary strength of 3 (as depicted in FIG. 8B) the dependent values may be determined to be the values corresponding to the row of pixels {n1/j2} and {n2/j1}. Thus, the new values of the rows of pixels {n1} and {n2} (generated by applying the deblocking filter to the base edge h4) may be retrieved from the buffer and used to respectively replace the values in the registers j2_3210 and j1_3210 (respectively storing the old values of the rows of pixels {j2} and {j1}).

It may be appreciated that the buffer may contain the new values of only the common pixels (determined based on the dependency among the edges). The buffer may be provided in data cache 417 instead of memory (such as RAM 420), thereby increasing the speed of access to the data. Control then passes to step 960.

In step 960, execution unit 415A performs filter operations (as part of applying the deblocking filter) using the input set to generate a corresponding output set of values, the output set representing the set of pixels after filtering. The output set of values may be generated and stored in another set of registers, conveniently named, j2_3210′, j1_3210′, j0_3210′, k0_3210′, k1_3210′, and k2_3210′ with the name indicating the pixels stored in the corresponding register.

It may be observed that in a scenario that the base edge is determined to be filtered, the input set used in performing the filter operations contains the new values of the common/dependent pixels (replaced in step 950) in conformance to the H.264 standard. Alternatively, in a scenario that the base edge is determined to be not filtered (bit in the mask=0), the loaded values are used as the input set in performing the filter operations.

As described above, the specific set of filter operations used to generate the output values is not described for conciseness. Further, though the output values are assumed to be generated in a different set of registers (having names similar to the input set of registers), the techniques described herein can also be applied when the output values are generated in data cache and/or memory.

In step 970, execution unit 415A computes a set of differences (according to equation 1 noted above) using the input set of values. The computed differences are compared with the respective threshold values described above with respect to FIG. 3C. As described above, each of the set of differences is computed based on any new values of the common/dependent pixels according to the H.264 standard.

The set of differences are computed for each of the set of pixels forming the edge. Thus, for horizontal edge h8, the set of differences is computed for each of the sets of pixels {j2, j1, j0, k0, k1, k2} by substituting the values of j0, j1, k0, k1 respectively for p0, p1, q0 and q1 in Equation 1. The set of differences are then used to determine the values to be written/stored in the memory locations as described below.

Though the computation of the set of differences are shown as being performed by execution unit 415A, it may be appreciated that the computation may be performed by another execution unit (such as 415B) in parallel with the performance of the filter operations in step 960, thereby improving the throughput performance. Such parallel performance of steps 960 and 970 is facilitated by the replacement of the dependent values from a buffer in step 950.

In step 980, execution unit 415A stores the output or input set of values to the buffer based on comparison results of the set of differences with respective threshold values according to Equation 1. The output set of values are stored in the buffer if the comparison satisfies the threshold requirements shown in Equation 1 and the input set of values are stored in the buffer otherwise.

It may be appreciated that the values (representing the new values of the pixels) stored in the buffer may later be used in step 950 when the deblocking filter is applied to the horizontal edge h12. As described above, only the dependent/common values in the output set may be stored in the buffer for convenience.

In step 990, execution unit 415A writes the output or input set of values (from the corresponding set of registers) to the corresponding memory locations based on the comparison results noted above. The memory locations may be the same memory locations from which the input set of values was read in step 920, in which case only the output set of values need to be written and the input set of values need not be written back. Alternatively, output or input set of value may be written to a corresponding set of memory locations where the post-filtered/displayed image frame is to be generated.

In one embodiment, a set of mutually exclusive conditional-store instructions are used to write the output/input set of values to the memory locations. Each conditional-store instruction receives as inputs a value to be written, a memory location at which the value is to be written and a condition. On execution, the value is written to the memory location only when the condition is fulfilled.

An output/input value is then written to a memory location by having two conditional-store instructions in tandem, whereby the threshold is provided as the condition of the instruction for writing the output value while the negation of the threshold is provided as condition of the other instruction for writing the input value. Accordingly, on execution of the tandem conditional-store instructions, the output value is written to the memory location when the threshold is fulfilled and the input value is written otherwise.

A set of tandem conditional-store instructions may be used to write the output or input set of values to the corresponding memory locations. In the above example, the storage of the output or input set of values is performed by four tandem conditional-store instructions (one for each of the sets of pixels {j2, j1, j0, k0, k1, k2}) with each of the tandem conditional-store instructions contains an instruction for storing the value in a corresponding output register (such as j2_3210′) and another instruction for storing the value in a corresponding input register (such as j2_3210).

It may be appreciated that in a scenario that the output set of values are to be written to the same memory locations (from which the input set of values was read in step 920), the store-switch instruction corresponding to the non-fulfillment of the threshold (for writing the input value) need not be executed.

The storage of the input set of values in the memory locations/buffer (instead of the output set of values) indicates that the deblocking filter has not been applied to the corresponding set of pixels of the present edge being processed. The flow chart ends in step 999.

It may be appreciated that the determination of whether a base edge is being filtered using the bit field, enables the application of deblocking filters to at least some of the edges in the macro-block in parallel (without necessitating waiting for the processing of the base edge to be completed).

Further, by storing the dependent values (determined based on the dependency among the edges) in a buffer in data cache 417 (faster than memory), the memory dependencies among the horizontal edges are reduced, further improving the throughput performance.

It may be observed that the above steps for the application of a deblocking filter are related to processing of edges in one orientation (horizontal or vertical). However, for the other orientation (vertical in case of horizontal and vice versa) the above steps may be modified based on the information contained in the patent document titled “Loop Deblock Filtering Of Block Coded Video In a Very Long Instruction Word Processor” by Jagadeesh Sankaran with publication number US 2005/0117653 available from US patent office.

In particular, the reader is directed to FIGS. 15 and 16 in the above noted patent document which illustrate a manner in which edges in the other orientation can be processed. Accordingly, the steps of FIG. 9 described above may be modified to exploit the transpose feature noted in the patent document, as described below with examples.

13. Processing Edges in the Other Orientation

Assuming that deblocking filter is to be applied to vertical edge v4 (and referring to FIG. 8A), the input set of values is read into a set of registers named s1_s0_r0_r1_0, s1_s0_r_r1_1, s1_s0_r0_r1_2, and s1_s0_r0_r1_3 j1_3, the name indicating the pixels stored in each register and the last number indicating the corresponding row. It may be observed that the corresponding values of the rows of pixel {r1} (having a dependency on the vertical edge v0) are read into each of the 4 registers, indicating that there are 4 memory dependencies.

The input values in the registers are then transposed and stored in the same/different register conveniently named s1_3210, s0_3210, r0_3210, and r1_3210. It may be observed that by transposing the input set of values, the rows of pixel {r1} are stored in a corresponding register r1_3210, indicating that the number of memory dependencies has been reduced from 4 to 1.

The filter operations are then performed using the transposed input set of values to generate a corresponding transposed output set of values in the set of registers. The filter operations may be performed on the transposed input set loaded from memory or on the input set containing the new values replaced from a buffer, if the base edge is determined to be filtered based on the bit field.

The transposed output set of values (representing the new values of the pixels) are then stored in the buffer and may later be used when the deblocking filter is applied to the vertical edge v8. As described above, only the dependent values in the transposed output set may be stored in the buffer for convenience.

The transposed output set of values is then transposed to generate an output set of values in the set of registers, the output set representing the values of the pixels after filtering. Thus, the transposed output set of values in the registers s1_3210, s0_3210, r0_3210, r1_3210 are transposed to generate the output set of values in the same/different registers conveniently named s1_s0_r0_r1_0, s1_s0_r0_r1_1, s1_s0_r0_r1_2, and s1_s0_r0_r1_3 similar to the registers into which the input set of values were read. The output set of values in the set of registers is then written to the corresponding set of memory locations.

It may be observed that the transposed set of output values (generated by application of the deblocking filter) are maintained in the buffer instead of the output set (generated after transposing), thereby improving the performance of the application of the deblocking filter to the edges in the other orientation.

14. Conclusion

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

It should be understood that the figures and/or screen shots illustrated in the attachments highlighting the functionality and advantages of the present invention are presented for example purposes only. The present invention is sufficiently flexible and configurable, such that it may be utilized in ways other than that shown in the accompanying figures.

Further, the purpose of the following Abstract is to enable the U.S. Patent and Trademark Office and the public generally, and especially the scientists, engineers and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The Abstract is not intended to be limiting as to the scope of the present invention in any way 

1. A machine readable medium carrying one or more sequences of instructions for causing a system to process image frames in encoded format, wherein execution of said one or more sequences of instructions by one or more processors contained in said system causes said system to perform the actions of: receiving a first plurality of values representing an image frame in encoded format, said image frame containing a plurality of macro-blocks, each of said macro-blocks in turn containing a plurality of sub-blocks, a plurality of horizontal edges and a plurality of vertical edges being formed by said plurality of sub-blocks, said plurality of horizontal edges and said plurality of vertical edges including pairs of edges of same orientation; decoding said first plurality of values to form a second plurality of values representing a reconstruction of said image frame in a decoded format; determining the specific ones of said pair of edges to which a deblocking filter is to be applied by evaluating a set of pre-conditions that need to be satisfied according to a standard; and applying said deblocking filter to the determined specific ones of said pair of edges, wherein said applying is performed after said determining.
 2. The machine readable medium of claim 1, wherein each of said pair of edges are adjacent to each other.
 3. The machine readable medium of claim 2, wherein said determining is performed for all edges in one orientation before performing said applying.
 4. The machine readable medium of claim 1, further comprising one or more instructions for: forming a bit field containing a set of bits, with each bit indicating whether said deblocking filter is to be applied to a corresponding edge.
 5. The machine readable medium of claim 4, further comprising one or more instructions for: loading said bit field into a register; and identifying a next bit starting from a first bit in said register, wherein said next bit indicates a next edge to which said deblocking filter is to be applied, wherein said identifying also identifies a following bit starting from said next bit, wherein said following bit indicates a following edge after said next edge to which said deblocking filter is to be applied.
 6. The machine readable medium of claim 5, wherein said identifying comprises: using an instruction which receives an offset as an input and indicates in said register a next bit position starting from said offset at which the corresponding bit equals a desired binary value, wherein said identifying identifies said next bit by invoking said instruction with said offset equal to the bit position of said first bit and then identifies said following bit by invoking said instruction with said offset equaling the bit position of said next bit in said bit field loaded into said register.
 7. The machine readable medium of claim 5, wherein said identifying comprises shifting said bit field in said register by a number of positions determined by the bit position at which said next bit is present in said bit field when loaded into said register.
 8. The machine readable medium of claim 5, further comprising determining a number of bits in said bit field indicating that deblocking filter is to be applied to corresponding edges, wherein said identifying identifies each present edge to which deblocking filter is to be applied in a corresponding loop, wherein said loop is executed said number of times.
 9. The machine readable medium of claim 8, further comprising one or more instructions for: maintaining an edge counter which indicates the number of bit positions from said first bit to a bit representing said present edge; and determining a first set of addresses of memory locations storing the specific ones of said second plurality of values which are required to apply said deblocking filter to said present edge based on said edge counter.
 10. The machine readable medium of claim 9, further comprising one or more instructions for: maintaining a lookup table indicating the addresses of memory locations storing said second plurality of values which are required to apply said deblocking filter corresponding to each of said plurality of horizontal edges and each of said plurality of vertical edges, wherein said lookup table is indexed based on said edge counter, wherein said determining determines said first set of addresses corresponding to said present edge based on said edge counter and said lookup table.
 11. The machine readable medium of claim 4, wherein said second plurality of values are stored in a plurality of memory locations of a memory, wherein said bit field indicates that deblocking filter is to be applied to a present edge, wherein said present edge requires values at a set of memory locations contained in said plurality of locations as inputs to said deblocking filter, further comprising one or more instructions for: loading the values from said set of memory locations into a set of registers,; checking whether said bit field indicates that said deblocking filter is to be applied to a base edge corresponding to said present edge, wherein application of said deblocking filter to said base edge causes at least some of the values in said set of memory locations to be modified to corresponding new values; applying said deblocking filter to said present edge using said values in said set of registers if said bit field indicates that said deblocking filter is not to be applied to said base edge; and waiting for availability of said new values before applying said deblocking filter to said present edge if said bit field indicates that said deblocking filter is to be applied to said base edge.
 12. The machine readable medium of claim 11, further comprising one or more instructions for: storing said new values in a buffer, which provides faster access than said memory; replacing the values in said set of registers using said new values in said buffer after said waiting; and applying said deblocking filter to said present edge using the replaced values in said set of registers.
 13. A method of processing image frames in encoded format, said method comprising: receiving a first plurality of values representing an image frame in encoded format, said image frame containing a plurality of macro-blocks, each of said macro-blocks in turn containing a plurality of sub-blocks, a plurality of horizontal edges and a plurality of vertical edges being formed by said plurality of sub-blocks, said plurality of horizontal edges and said plurality of vertical edges including a pair of adjacent edges of same orientation; decoding said first plurality of values to form a second plurality of values representing said image frame in a decoded format; determining the specific ones of said pair of adjacent edges to which a deblocking filter is to be applied by evaluating any pre-conditions that need to be satisfied according to a standard; and applying said deblocking filter to the determined specific ones of said pair of adjacent edges, wherein said determining is performed for all edges in one orientation before performing said applying.
 14. The method of claim 13, further comprising forming a bit field containing a set of bits, with each bit indicating whether said deblocking filter is to be applied to a corresponding edge.
 15. The method of claim 14, further comprising: loading said bit field into a register; and identifying a next bit starting from a first bit in said register, wherein said next bit indicates a next edge to which said deblocking filter is to be applied, wherein said identifying also identifies a following bit starting from said next bit, wherein said following bit indicates a following edge after said next edge to which said deblocking filter is to be applied.
 16. The method of claim 14, wherein said identifying comprises: using an instruction which receives an offset as an input and indicates in said register a next bit position starting from said offset at which the corresponding bit equals a desired binary value, wherein said identifying identifies said next bit by invoking said instruction with said offset equal to the bit position of said first bit and then identifies said following bit by invoking said instruction with said offset equaling the bit position of said next bit in said bit field loaded into said register.
 17. The method of claim 15, further comprising determining a number of bits in said bit field indicating that deblocking filter is to be applied to corresponding edges, wherein said identifying identifies each present edge to which deblocking filter is to be applied in a corresponding loop, wherein said loop is executed said number of times.
 18. The method of claim 17, further comprising: maintaining an edge counter which indicates the number of bit positions from said first bit to a bit representing said present edge; and computing addresses of memory locations storing the specific ones of said second plurality of values which are required to apply said deblocking filter to said present edge.
 19. The method of claim 14, wherein said second plurality of values are stored in a plurality of memory locations of a memory, wherein said bit field indicates that deblocking filter is to be applied to a present edge, wherein said present edge requires values at a set of memory locations contained in said plurality of locations as inputs to said deblocking filter, said method further comprising: loading the values from said set of memory locations into a set of registers,; checking whether said bit field indicates that said deblocking filter is to be applied to a base edge corresponding to said present edge, wherein application of said deblocking filter to said base edge causes at least some of the values in said set of memory locations to be modified to corresponding new values; applying said deblocking filter to said present edge using said values in said set of registers if said bit field indicates that said deblocking filter is not to be applied to said base edge; and waiting for availability of said new values before applying said deblocking filter to said present edge if said bit field indicates that said deblocking filter is to be applied to said base edge.
 20. The method of claim 19, further comprising: storing said new values in a buffer, which provides faster access than said memory; replacing the values in said set of registers using said new values in said buffer after said waiting; and applying said deblocking filter to said present edge using the replaced values in said set of registers. 